CI: Use per-machine runner labels for GPU load test#299
CI: Use per-machine runner labels for GPU load test#299gyohuangxin wants to merge 7 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the manual GPU load test workflow to target specific self-hosted runner labels per physical machine, avoiding duplicate runs caused by a shared runner label.
Changes:
- Replaced the single shared runner label in the default (“all”) runner matrix with a fixed list of per-machine runner labels.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - name: Kill all Docker containers and clean up workspace | ||
| run: | | ||
| echo "=== Cleaning up containers on $(hostname) ===" | ||
| containers=$(docker ps -q) | ||
| if [ -n "$containers" ]; then | ||
| docker kill $containers || true | ||
| fi | ||
| docker rm -f ${{ env.CONTAINER_NAME }} 2>/dev/null || true | ||
| docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true |
There was a problem hiding this comment.
The cleanup step no longer removes an existing ${{ env.CONTAINER_NAME }} container. Since docker ps -q only lists running containers, a previously-stopped gpu_load_test container could remain and make the later docker run -dt --name ${{ env.CONTAINER_NAME }} fail with a name conflict. Consider explicitly docker rm -f ${{ env.CONTAINER_NAME }} (or removing all containers you kill) as part of this step.
| INPUT="${{ inputs.runners || 'all' }}" | ||
| if [ "$INPUT" = "all" ]; then | ||
| MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"MI355-8GPU"}]' | ||
| MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]' |
There was a problem hiding this comment.
This workflow now uses per-machine runner labels for the default all case, but .github/runner-config.yml still only documents the old shared atom-mi355-8gpu.predownload label and indicates it should be updated when CI runner labels change. To keep the runner→hardware mapping (and any dependent dashboards) accurate, add entries for the new mia1-p01-g* labels there (or otherwise update the mapping source of truth).
| MODEL_NAME="deepseek-ai/DeepSeek-R1-0528" | ||
| MODEL_LOCAL_PATH="/data/deepseek-ai/DeepSeek-R1-0528" | ||
| MODEL_LOCAL_PATH="/models/deepseek-ai/DeepSeek-R1-0528" | ||
| TENSOR_PARALLEL=8 | ||
| KV_CACHE_DTYPE="fp8" | ||
| TEMPERATURE=0 |
There was a problem hiding this comment.
MODEL_LOCAL_PATH was changed to /models/..., but the script’s docker run later only bind-mounts /data into the container. If the model is found locally, MODEL_PATH will be a /models/... path that won’t exist inside the container, causing the inference command to fail. Either mount /models into the container as well, or add the same /models→/data fallback logic used in the GitHub Actions workflow.
grep -c outputs '0' to stdout then exits with code 1 when no matches are found. The '|| echo 0' fallback then outputs another '0', making LOAD_COUNT='0\n0' which breaks the integer comparison. Use '|| true' to suppress the error exit and default to 0 if empty.
Move all analysis logic into gpu_load_test.sh and remove redundant analysis step from the workflow. The script now handles GPU status check, model loading, timing analysis, and detailed diagnostics (including problem analysis and recommended actions for high variance).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - name: Kill all Docker containers and clean up workspace | ||
| run: | | ||
| echo "=== Cleaning up containers on $(hostname) ===" | ||
| containers=$(docker ps -q) |
There was a problem hiding this comment.
docker kill $(docker ps -q) will terminate all containers running on the self-hosted runner, including unrelated long-lived services/containers that might be required on that machine. Please scope cleanup to containers created by this workflow (e.g., stop/rm only atom_inference, or filter by a workflow-specific label) rather than killing everything.
| - name: Kill all Docker containers and clean up workspace | |
| run: | | |
| echo "=== Cleaning up containers on $(hostname) ===" | |
| containers=$(docker ps -q) | |
| - name: Kill workflow Docker containers and clean up workspace | |
| run: | | |
| echo "=== Cleaning up workflow containers on $(hostname) ===" | |
| containers=$(docker ps -q --filter "name=atom_inference") |
| MOUNT_FLAGS="" | ||
| [ -d "/data" ] && MOUNT_FLAGS="$MOUNT_FLAGS -v /data:/data" | ||
| [ -d "/models" ] && MOUNT_FLAGS="$MOUNT_FLAGS -v /models:/models" | ||
| docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true |
There was a problem hiding this comment.
The workspace cleanup command rm -rf /workspace/* will not remove dotfiles (e.g., .git, .github, .cache), so state can leak between runs on self-hosted runners. Consider using a cleanup that also removes dotfiles (or rely on actions/checkout with clean: true/git clean -ffdx) so each run starts from a truly clean workspace.
| docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true | |
| docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/* /workspace/.[!.]* /workspace/..?*" || true |
| INPUT="${{ inputs.runners || 'all' }}" | ||
| if [ "$INPUT" = "all" ]; then | ||
| MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"MI355-8GPU"}]' | ||
| MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]' |
There was a problem hiding this comment.
This workflow now targets new runner labels (mia1-p01-g33, g34, etc.). .github/runner-config.yml explicitly says to update the runner→GPU mapping when runner labels in workflows change, but it currently only lists atom-mi355-8gpu.predownload and linux-atom-mi355-*. Please add entries for these new labels (or adjust the workflow to use existing mapped labels) so downstream tooling stays accurate.
| MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]' | |
| MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"atom-mi355-8gpu.predownload"},{"runner":"linux-atom-mi355-8gpu","label":"linux-atom-mi355-8gpu"}]' |
| echo "═══════════════════════════════════════════════════════" | ||
| echo "" | ||
| echo "1. 🔄 RUN TEST AGAIN to see if same GPUs are slow:" | ||
| echo " ./test_gpu_load.sh" |
There was a problem hiding this comment.
The diagnostic output suggests re-running ./test_gpu_load.sh, but this repository script is .github/scripts/gpu_load_test.sh (and the workflow executes that). Please update the printed command/path to the actual script name/location to avoid confusion during manual troubleshooting.
| echo " ./test_gpu_load.sh" | |
| echo " ./.github/scripts/gpu_load_test.sh" |
|
|
||
| # Identify slow GPUs | ||
| SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v min="$MIN_TIME" '{ | ||
| SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v max="$MAX_TIME" -v min="$MIN_TIME" '{ |
There was a problem hiding this comment.
awk is passed -v max="$MAX_TIME" here, but max is never referenced in the awk program. Please drop the unused variable to avoid confusion (or use it if it was intended for the threshold logic).
| SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v max="$MAX_TIME" -v min="$MIN_TIME" '{ | |
| SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v min="$MIN_TIME" '{ |
Summary
atom-mi355-8gpu.predownloadlabelFollows up on #298